R Advanced Course

Machine learning algorithms in R - I

Objective and contents

Objective and contents

Objective: to work with machine learning algorithms in R for spatio-temporal prediction of variables.

  • Introduction to machine learning algorithms
  • Machine learning packages in R
  • Random Forest
  • Support Vector Machines
  • Neural Networks
  • Using machine learning for prediction
  • Example - machine learning for streamflow prediction

Introduction to machine learning algorithms

Machine Learning

Machine learning is a branch of Artificial Intelligence that aims to use data and algorithms to imitate the way humans learn (IBM), which focuses on performing a known task.

Machine Learning

Machine learning algorithms have been widely used:

  • Virtual assistants
  • Video surveillance
  • Social media
  • Pattern recognition
  • Spatial data analysis
  • Risk assessment

Machine Learning

Machine learning has proven to be useful in many fields. What are the steps to apply machine learning methods?

  • Data collection
  • Data preparation and pre-processing
  • Model(s) selection
  • Model(s) training
  • Performance evaluation
  • Enhance model(s) performance

Which machine algorithm to choose?

There is not a best performing machine learning algorithm. Their performance depend on the type of data and the final objectives. Basically, the two main groups of machine learning algorithms are:

Machine learning algorithms

The following algorithms are often used for supervised learning:

  • Nearest Neighbour
  • Random Forest (RF)
  • Support Vector Machines (SVM)
  • Neural Networks (NN)
  • naive Bayes

Machine learning algorithms

The following algorithms are often used for unsupervised learning:

  • k-nearest neighbours (K-NN)
  • Principal Component Analysis
  • Singular Value Decomposition
  • Independent Component Analysis

Machine learning packages in R

Machine learning packages in R

In this course, we will focus mainly on supervised learning algorithms as we want to work with regression and classification of variables. There are several R packages related to machine learning:

  • e1071: fuzzy clustering, SVM, naive Bayes, and K-NN
  • kernlab: kernel-based machine learning methods for classification, regression, clustering, novelty detection, quantile regression and dimensionality reduction, SVM, Spectral Clustering, and PCA
  • caret: set of functions that seeks to streamline the method for creating predictive models

Machine learning packages in R

  • DataExplorer: automated data exploration process for analytic tasks and predictive modeling, so that users could focus on understanding data and extracting insights
  • superml: provide a standard interface to users who use both R and Python for building machine learning models
  • mlr3: efficient, object-oriented programming on the building blocks of machine learning
  • party: conditional inference trees
  • rpart: recursive partitioning for classification, regression and survival trees
  • neuralnet: neural networks using backpropagation, resilient backpropagation with or without weight backtracking
  • ranger: a fast implementation of Random Forests, particularly suited for high dimensional data

Random Forest

Random Forest

Random Forest (RF; Biau and Scornet, 2016, Breiman, 2001, Prasad et al., 2006) is an ensemble learning method that can be used for supervised classification and regression tasks by constructing numerous decision trees using the relationship between independent and dependent variables.

  • It is recognised for being accurate and able to deal with small sample sizes and high-dimensional feature spaces.
  • Performs well even when some explanatory variables do not add information to the prediction and when several covariates are used.
  • Does not produce biased estimates or lead to overfitting.

Random Forest

\[ \hat{\theta}^B(x) = \frac{1}{B} \sum_{b=1}^B{t^*_b(x)} \] \(\hat{\theta}^B(x)\) is the final prediction

\(b\) is the individual bootstrap sample

\(B\) is the total number of trees

\(t^*_b\) is the individual decision tree

Random Forest

Support Vector Machines

Support Vector Machines

Support Vector Machines (SVM; Hearst et al., 1998, Steinwart et al., 2008) aims to find a hyperplane in an N-dimensional space that distinctly classifies the given data points. As there are many hyperplanes that can be selected, SVM maximises the margin distance between data points of different classes.

Neural Networks

Neural Networks

Neural Networks (NN; ) also known as Artificial Neural Networks are composed of node layers. They have an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold (IBM). If the output of a designed node its above the specified threshold, the node sends information to the next one.

Neural Networks

Using machine learning for prediction

Using machine learning for prediction

Now that we got an overview of different machine learning algorithms, let’s use them for a regression exercise. The mtcars dataset (included in R) comprises fuel consumption and 10 aspects of automobile design and performance for 32 cars.

print(mtcars)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Using machine learning for prediction

We want to predict the horsepower (hp) of some cars using the other parameters of the cars:

  • mpg: Miles/(US) gallon
  • cyl: Number of cylinders
  • disp: Displacement (cu.in.)
  • drat: Rear axle ratio
  • wt: Weight (1000 lbs)
  • qsec: 1/4 mile time
  • vs: Engine (0 = V-shaped, 1 = straight)
  • am: Transmission (0 = automatic, 1 = manual)
  • gear: Number of forward gears

Using machine learning for prediction

In this exercise we will use Random Forest, Support Vector Machines, and Neural Networks. Therefore, let’s load the correspondent packages:

library(randomForest)
library(e1071)
library(neuralnet)

To predict the hp of the cars, let’s separate randomly the data in a training sample (to train the machine learning algorithms) and a validation sample (to evaluate their performance). For reproducibility, we will set a seed in the code. Setting a seed in R means to initialize a pseudorandom number generator.

set.seed(28)

Using machine learning for prediction

For separating the data, lets store in an object the total number of cars.

n <- nrow(mtcars)

After this, we can generate a random sample of positions. Let’s use the 80% of the total number of cars for training purposes.

pos_training <- sample(1:nrow(mtcars), round(n*.8))
print(pos_training)
##  [1] 17 32  9 31  2 14 11  1 19  8  3 25 16 20 22 12 30 18  7 24 27  6 29 23 28
## [26]  5

Using machine learning for prediction

Now, we can separate the mtcars dataset into a training and validation sets.

training   <- mtcars[pos_training,]
validation <- mtcars[-pos_training,]
print(training)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
print(validation)
##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive     21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Toyota Corona      21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9          27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

Random Forest

The randomForest package implements Breiman’s random forest algorithm for classification and regression.

randomForest(x, data, ntree = 500, mtry, nodesize, maxnodes, na.action)

x: a data frame or a matrix of predictors, or a formula describing the model to be fitted

data: a data frame containing the variables in the model

ntree: Number of trees to grow

mtry: Number of variables randomly sampled as candidates at each split

Random Forest

randomForest(x, data, ntree = 500, mtry, nodesize, maxnodes, na.action)

nodesize: Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time)

maxnodes: Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize)

na.action: A function to specify the action to be taken if NAs are found

There are more parameters related to this function. Please type ?randomForest to see them.

Random Forest

In this example, we will use the default values of the function. The formula that will be applied is \(hp\sim.\). This means that \(hp\) is the dependent variable and the \(.\) means that all the variables from the training dataset will be used.

rf_model <- randomForest(hp ~ ., data = training)

In the case that we want to use only particular parameters we could set them as follows:

randomForest(hp ~ mpg + cyl + disp, data = training)

Random Forest

We can print the rf_model object.

print(rf_model)
## 
## Call:
##  randomForest(formula = hp ~ ., data = training) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 1322.719
##                     % Var explained: 73.54

Random Forest

To predict the hp values from the validation set, we will use the predict function, which requires the model and the covariates to predict the horsepower of the cars. To make this explicit, we will exclude the column of hp from the data.frame.

rf_prediction <- predict(rf_model, validation[,-4])
print(rf_prediction)
##     Hornet 4 Drive           Merc 280         Merc 450SL Cadillac Fleetwood 
##          123.43118          122.20713          173.63778          219.34986 
##      Toyota Corona          Fiat X1-9 
##           96.53314           74.71494

Random Forest

We can create a data.frame with the real values (Obs) and the Random Forest predictions (RF).

data.frame(Obs = validation[,4], RF = rf_prediction)
##                    Obs        RF
## Hornet 4 Drive     110 123.43118
## Merc 280           123 122.20713
## Merc 450SL         180 173.63778
## Cadillac Fleetwood 205 219.34986
## Toyota Corona       97  96.53314
## Fiat X1-9           66  74.71494

Random Forest

The function varImpPlot will generate a dotchart of variable importance as measured by a Random Forest.

varImpPlot(rf_model)

Support Vector Machines

The e1071 is used to train a support vector machine for classification and regression purposes.

svm(formula, data, kernel, degree, gamma, coef0, na.action)

formula: a data frame or a matrix of predictors, or a formula describing the model to be fitted

data: a data frame containing the variables in the model

kernel: the kernel used in training and predicting: linear, polynomial, radial, and sigmoid

degree: parameter needed for kernel of type polynomial

Support Vector Machines

svm(formula, data, kernel, degree, gamma, coef0, na.action)

gamma: parameter needed for all kernels except linear

coef0: parameter needed for kernels of type polynomial and sigmoid

na.action: A function to specify the action to be taken if NAs are found

There are more parameters related to this function. Please type ?svm to see them.

Support Vector Machines

Similarly to RF, the formula that will be applied is \(hp\sim.\). Let’s use the different kernels.

svm_model_lin    <- svm(hp ~ ., data = training, kernel = "linear")
svm_model_poly   <- svm(hp ~ ., data = training, kernel = "polynomial")
svm_model_rad    <- svm(hp ~ ., data = training, kernel = "radial")
svm_model_sig    <- svm(hp ~ ., data = training, kernel = "sigmoid")

Now we can predict the hp values:

svm_prediction_lin  <- predict(svm_model_lin, validation[,-4])
svm_prediction_poly <- predict(svm_model_poly, validation[,-4])
svm_prediction_rad  <- predict(svm_model_rad, validation[,-4])
svm_prediction_sig  <- predict(svm_model_sig, validation[,-4])

Support Vector Machines

Finally, we can summarise the results.

data.frame(Obs = validation[,4], SVM_lin = svm_prediction_lin,
           SVM_poly = svm_prediction_poly, SVM_rad = svm_prediction_rad, 
           SVM_sig = svm_prediction_sig)
##                    Obs   SVM_lin  SVM_poly   SVM_rad   SVM_sig
## Hornet 4 Drive     110 121.51574 123.47778 106.47474 120.99467
## Merc 280           123 156.98698 146.04602 125.07476 142.84838
## Merc 450SL         180 156.65475 168.71280 171.96696 160.63893
## Cadillac Fleetwood 205 236.92939 223.57702 211.27667 226.52371
## Toyota Corona       97  45.72309  80.75757  91.49229  82.29314
## Fiat X1-9           66  69.46686  77.96054  77.80608  68.92851

Neural Networks

The neuralnet package allows to train neural networks using backpropagation, resilient backpropagation (RPROP) with or without weight backtracking.

neuralnet(formula, data, hidden, threshold, stepmax, algorithm)

formula: a data frame or a matrix of predictors, or a formula describing the model to be fitted

data: a data frame containing the variables in the model

hidden: a vector of integers specifying the number of hidden neurons (vertices) in each layer

Neural Networks

neuralnet(formula, data, hidden, threshold, stepmax, algorithm)

threshold: a numeric value specifying the threshold for the partial derivatives of the error function as stopping criteria

stepmax: the maximum steps for the training of the neural network

algorithm: a string containing the algorithm type to calculate the neural network. The following types are possible: ‘backprop’, ‘rprop+’, ‘rprop-’, ‘sag’, or ‘slr’

There are more parameters related to this function. Please type ?neuralnet to see them.

Neural Networks

The formula that will be applied is \(hp\sim.\).

nn_model  <- neuralnet(hp~., data=training)

Now we can predict the hp values:

nn_prediction  <- predict(nn_model, validation[,-4])

Neural Networks

Finally, we can summarise the results.

data.frame(Obs=validation[,4], NN=nn_prediction)
##                    Obs       NN
## Hornet 4 Drive     110 150.4998
## Merc 280           123 150.4998
## Merc 450SL         180 150.4998
## Cadillac Fleetwood 205 150.4998
## Toyota Corona       97 150.4998
## Fiat X1-9           66 150.4998
  • What happened?
  • The first reason to consider when you get weird results with neural networks is normalization. Your data must be normalized, otherwise, yes, the training will result in skewed NN which will produce the same outcome all the time.

Neural Networks

Let’s normalise the data.

norm_df <-rbind(training, validation)

#define Min-Max normalization function
minmax_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

norm <- apply(norm_df, 2, minmax_norm)
head(norm)
##                         mpg cyl      disp        hp      drat        wt
## Chrysler Imperial 0.1829787 1.0 0.9201796 0.6289753 0.2165899 0.9798006
## Volvo 142E        0.4680851 0.0 0.1244699 0.2014134 0.6221198 0.3239581
## Merc 230          0.5276596 0.0 0.1738588 0.1519435 0.5345622 0.4185630
## Maserati Bora     0.1957447 1.0 0.5734597 1.0000000 0.3594470 0.5259524
## Mazda RX4 Wag     0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.3482485
## Merc 450SLC       0.2042553 1.0 0.5106011 0.4522968 0.1428571 0.5796471
##                         qsec vs am gear      carb
## Chrysler Imperial 0.34761905  0  0  0.0 0.4285714
## Volvo 142E        0.48809524  1  1  0.5 0.1428571
## Merc 230          1.00000000  1  0  0.5 0.1428571
## Maserati Bora     0.01190476  0  1  1.0 1.0000000
## Mazda RX4 Wag     0.30000000  0  1  0.5 0.4285714
## Merc 450SLC       0.41666667  0  0  0.0 0.2857143

Neural Networks

Now, we can subset the data accordingly.

norm_training   <- norm[1:nrow(training),]
norm_validation <- norm[-c(1:nrow(training)),]

Finally, we can apply the NN model.

nn_model_log  <- neuralnet(hp~., data = norm_training, act.fct="logistic")
nn_model_tan  <- neuralnet(hp~., data = norm_training, act.fct="tanh")

Neural Networks

We can plot the NN.

plot(nn_model_log)

We can plot the NN in a nicer way.

plot(nn_model_log, col.hidden = 'darkgreen', col.hidden.synapse = 'darkgreen',
     show.weights = FALSE, information = FALSE, fill = 'lightblue')

Neural Networks

We can use the models to predict the variable.

nn_prediction_log  <- predict(nn_model_log, norm_validation[,-4])
nn_prediction_tan  <- predict(nn_model_tan, norm_validation[,-4])

To visualise the results:

data.frame(Obs = norm_validation[,4], NN_log = nn_prediction_log, NN_tan = nn_prediction_tan)
##                           Obs     NN_log       NN_tan
## Hornet 4 Drive     0.20494700 0.15848709  0.280470546
## Merc 280           0.25088339 0.30682250  0.338217726
## Merc 450SL         0.45229682 0.35768535  0.359986470
## Cadillac Fleetwood 0.54063604 0.61909455  0.705239030
## Toyota Corona      0.15901060 0.10892604 -0.003904399
## Fiat X1-9          0.04946996 0.08748578  0.041047591

Neural Networks

Finally we can backtransform the hp data:

hp_val_log <- nn_prediction_log * (max(norm_df$hp) - min(norm_df$hp)) + min(norm_df$hp)
hp_val_tan <- nn_prediction_tan * (max(norm_df$hp) - min(norm_df$hp)) + min(norm_df$hp)

data.frame(Obs = validation[,4], NN_log = hp_val_log, NN_tan = hp_val_tan)
##                    Obs    NN_log    NN_tan
## Hornet 4 Drive     110  96.85185 131.37316
## Merc 280           123 138.83077 147.71562
## Merc 450SL         180 153.22495 153.87617
## Cadillac Fleetwood 205 227.20376 251.58265
## Toyota Corona       97  82.82607  50.89506
## Fiat X1-9           66  76.75848  63.61647

Using machine learning for prediction

Finally let’s create a summary of the results.

result <- data.frame(Obs = validation[,4], RF = rf_prediction, SVM_lin = svm_prediction_lin,
              SVM_poly = svm_prediction_poly, SVM_sig = svm_prediction_sig,
              NN_log = hp_val_log, NN_tan = hp_val_tan)
print(result)
##                    Obs        RF   SVM_lin  SVM_poly   SVM_sig    NN_log
## Hornet 4 Drive     110 123.43118 121.51574 123.47778 120.99467  96.85185
## Merc 280           123 122.20713 156.98698 146.04602 142.84838 138.83077
## Merc 450SL         180 173.63778 156.65475 168.71280 160.63893 153.22495
## Cadillac Fleetwood 205 219.34986 236.92939 223.57702 226.52371 227.20376
## Toyota Corona       97  96.53314  45.72309  80.75757  82.29314  82.82607
## Fiat X1-9           66  74.71494  69.46686  77.96054  68.92851  76.75848
##                       NN_tan
## Hornet 4 Drive     131.37316
## Merc 280           147.71562
## Merc 450SL         153.87617
## Cadillac Fleetwood 251.58265
## Toyota Corona       50.89506
## Fiat X1-9           63.61647

Exercise - machine learning for streamflow prediction

Machine learning for streamflow prediction

Now let’s do a more practical example. Imagine that we have two nested catchments, and we have some missing data in the lower catchment that we need. Let’s use machine learning to predict those missing values.

Streamflow prediction

To work with the time series, we will use the hydroTSM package. First, let’s read the data, which is stored as a zoo object.

library(hydroTSM)
q_path <- "C:/User/Data/L3_Machine_Learning_I/Streamflow_mm.zoo"
q <- read.zoo(q_path, header = TRUE)

Streamflow prediction

plot(q, main = "Streamflow [mm]")

Streamflow prediction

Let’s separate the data of the catchment and subcatchment accordingly.

subcatch <- q[,1]
catch    <- q[,2] 

As in this case, we have the data complete, we will create an additional object with the streamflow of the catchment and we will delete some years of data. In this sense, we will be able to compare the measured values with the predictions.

catch_na <- catch

Streamflow prediction

We can store the dates of the zoo in an object.

dates <- index(q)
head(dates)
## [1] "1990-01-01" "1990-01-02" "1990-01-03" "1990-01-04" "1990-01-05"
## [6] "1990-01-06"
tail(dates)
## [1] "2018-12-26" "2018-12-27" "2018-12-28" "2018-12-29" "2018-12-30"
## [6] "2018-12-31"

Now, we will set the first six years of catch_na to NAs. For this purpose, we will create a pos object that will store the position of the data related to ‘1995-12-31’.

pos <- grep("1995-12-31", dates)
catch_na[1:pos] <- NA

Streamflow prediction

In this sense, the object catch will have the complete data.

plot(catch, main = "Streamflow [mm]")

Streamflow prediction

While the object catch_na will have six years of missing values.

plot(catch_na, main = "Streamflow [mm]")

Streamflow prediction

We will partition the data that will be used for training and verification purposes using the pos object that we created earlier.

# The training set should start one day after "1995-12-31"
start <- pos + 1

subcatch_training <- subcatch[, start:length(subcatch)]
catch_training    <- catch[, start:length(catch)]

head(subcatch_training)
## 1990-01-01 1990-01-02 1990-01-03 1990-01-04 1990-01-05 1990-01-06 
##   4.540620   4.428097   4.269241   4.203052   4.090529   4.004482
head(catch_training)
## 1990-01-01 1990-01-02 1990-01-03 1990-01-04 1990-01-05 1990-01-06 
##   2.521102   2.452117   2.392539   2.389403   2.285925   2.220075

Streamflow prediction

training <- data.frame(subcatch = subcatch_training, catch = catch_training)
head(training)
##            subcatch    catch
## 1990-01-01 4.540620 2.521102
## 1990-01-02 4.428097 2.452117
## 1990-01-03 4.269241 2.392539
## 1990-01-04 4.203052 2.389403
## 1990-01-05 4.090529 2.285925
## 1990-01-06 4.004482 2.220075
tail(training)
##            subcatch    catch
## 2018-12-26 3.263157 1.859470
## 2018-12-27 3.203586 1.803027
## 2018-12-28 3.130777 1.774806
## 2018-12-29 3.177110 1.771670
## 2018-12-30 3.276395 1.909641
## 2018-12-31 3.005016 1.762263
summary(training)
##     subcatch           catch        
##  Min.   : 0.9664   Min.   : 0.4829  
##  1st Qu.: 2.7403   1st Qu.: 1.3672  
##  Median : 4.5208   Median : 2.8190  
##  Mean   : 5.7265   Mean   : 4.1321  
##  3rd Qu.: 7.2147   3rd Qu.: 5.7070  
##  Max.   :45.2076   Max.   :47.0982  
##  NA's   :123       NA's   :7

Streamflow prediction

During the training, we can face some problems if we have missing data. Let’s exclude missing data from the training.

pos_nas <- which(is.na(training$subcatch))
pos_nas <- c(pos_nas, which(is.na(training$catch)))
print(pos_nas)
##   [1] 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410
##  [16] 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425
##  [31] 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440
##  [46] 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455
##  [61] 7456 8188 8189 9442 9443 9444 9445 9446 9447 9448 9449 9450 9451 9453 9454
##  [76] 9455 9456 9457 9458 9460 9461 9462 9463 9464 9465 9466 9467 9468 9469 9470
##  [91] 9471 9472 9473 9474 9475 9476 9477 9478 9479 9480 9481 9482 9483 9484 9485
## [106] 9486 9487 9488 9489 9490 9491 9492 9493 9494 9495 9496 9497 9498 9499 9798
## [121] 9799 9800 9801 2237 2238 2239 2240 5537 6686 9831

training <- training[-pos_nas,]
summary(training)
##     subcatch           catch        
##  Min.   : 0.9664   Min.   : 0.4829  
##  1st Qu.: 2.7403   1st Qu.: 1.3609  
##  Median : 4.5274   Median : 2.8692  
##  Mean   : 5.7287   Mean   : 4.1585  
##  3rd Qu.: 7.2147   3rd Qu.: 5.7383  
##  Max.   :45.2076   Max.   :47.0982

Streamflow prediction

The next step is to train the models. For this purpose, we will use Random Forest and Support Vector Machines with the different kernels.

rf_model         <- randomForest(catch ~ ., data = training)
svm_model_lin    <- svm(catch ~ ., data = training, kernel = "linear")
svm_model_poly   <- svm(catch ~ ., data = training, kernel = "polynomial")
svm_model_rad    <- svm(catch ~ ., data = training, kernel = "radial")
svm_model_sig    <- svm(catch ~ ., data = training, kernel = "sigmoid")

Streamflow prediction

Now, let’s prepare the data that will be used to predict the values for the subcatchment (i.e., the daily streamflow of the first six years of the catchment).

covariate <- data.frame(subcatch = subcatch[1:pos])
head(covariate)
##            subcatch
## 1990-01-01 4.540620
## 1990-01-02 4.428097
## 1990-01-03 4.269241
## 1990-01-04 4.203052
## 1990-01-05 4.090529
## 1990-01-06 4.004482
tail(covariate)
##            subcatch
## 1995-12-26 3.918436
## 1995-12-27 3.819151
## 1995-12-28 3.653676
## 1995-12-29 3.541154
## 1995-12-30 3.441869
## 1995-12-31 3.375679

Streamflow prediction

Finally, we can predict the streamflow values using the different models and the covariate object.

rf_prediction       <- predict(rf_model, covariate)
svm_prediction_lin  <- predict(svm_model_lin, covariate)
svm_prediction_poly <- predict(svm_model_poly, covariate)
svm_prediction_rad  <- predict(svm_model_rad, covariate)
svm_prediction_sig  <- predict(svm_model_sig, covariate)

Streamflow prediction

Now that we have the predicted values we can convert them to zoo objects.

rf_prediction       <- zoo(rf_prediction, dates[1:pos])
svm_prediction_lin  <- zoo(svm_prediction_lin, dates[1:pos])
svm_prediction_poly <- zoo(svm_prediction_poly, dates[1:pos])
svm_prediction_rad  <- zoo(svm_prediction_rad, dates[1:pos])
svm_prediction_sig  <- zoo(svm_prediction_sig, dates[1:pos])

Streamflow prediction

We can plot the results of the predictions as follows:

plot(catch_na)
lines(rf_prediction, col="green")
lines(svm_prediction_lin, col="blue")
lines(svm_prediction_poly, col="red")
lines(svm_prediction_rad, col="cyan")
lines(svm_prediction_sig, col="purple")

grid()
legend("topright", lwd = 1, c("RF", "SVM-Linear", "SVM-Polynomial", "SVM-Radial", "SVM-Sigmoid"),
       col = c("black", "green", "blue", "red", "cyan", "purple"), bty = "n")

Streamflow prediction

We can plot the results of the predictions as follows:

Streamflow prediction

Removing the low performing models:

plot(catch_na)
lines(rf_prediction, col="green")
lines(svm_prediction_lin, col="blue")
lines(svm_prediction_rad, col="cyan")

grid()
legend("topright", lwd = 1, c("RF", "SVM-Linear", "SVM-Radial"),
       col = c("black", "green", "blue", "cyan"), bty = "n")

Streamflow prediction

We can plot the results of the predictions as follows:

Streamflow prediction

We can use a performance indicator to assess which model performed the best. The Kling-Gupta efficiency (KGE; Gupta et al., 2009, Kling et al., 2012) has three components: the Pearson correlation coefficient (\(r\)); the bias ratio (\(\beta\)); and the variability ratio (\(\gamma\)).

Streamflow prediction

\[\scriptsize \textrm{KGE'} = 1 - \sqrt{ (r-1)^2 + (\beta-1)^2 + (\gamma-1)^2 } \]

 

\[\scriptsize r = \frac{ \sum_{i=1}^n{(O_i - \bar{O}) (S_i - \bar{S}) } }{\sqrt{\sum_{i=1}^n{(O_i - \bar{O})^2}}\sqrt{\sum_{i=1}^n{(S_i - \bar{S})^2}}}; \beta = \frac{\mu_{s}}{\mu_{o}}; \gamma = \frac{CV_{s}}{CV_{o}} = \frac{\sigma_{s}/\mu_{s}}{\sigma_{o}/\mu_{o}} \]

\(\mu\) is the mean streamflow (\(Q\)), \(CV\) is the coefficient of variation, \(\sigma\) represents the standard deviation of \(Q\), and the subscripts \(s\) and \(o\) represent simulated and observed \(Q\), respectively. The KGE’ and its components have their optimum value at one, and its optimisation seeks to reproduce the temporal dynamics (measured by \(r\)), while preserving the volume and variability of \(Q\), measured by \(\beta\) and \(\gamma\), respectively.

Streamflow prediction

We can create and object that stores the observations of the verification period.

obs <- catch[1:pos]

Finally we can apply the KGE’ by using the KGE function from the hydroGOF package.

library(hydroGOF)

Streamflow prediction

KGE(obs = obs, sim = rf_prediction, method = "2012", out.type = "full")
## $KGE.value
## [1] 0.9234464
## 
## $KGE.elements
##         r      Beta     Gamma 
## 0.9682697 1.0035594 0.9304229
KGE(obs = obs, sim = svm_prediction_lin, method = "2012", out.type = "full")
## $KGE.value
## [1] 0.9080331
## 
## $KGE.elements
##         r      Beta     Gamma 
## 0.9559886 0.9970294 0.9193025
KGE(obs = obs, sim = svm_prediction_poly, method = "2012", out.type = "full")
## $KGE.value
## [1] 0.08816593
## 
## $KGE.elements
##         r      Beta     Gamma 
## 0.6845051 0.7331579 1.8128343
KGE(obs = obs, sim = svm_prediction_rad, method = "2012", out.type = "full")
## $KGE.value
## [1] 0.8847093
## 
## $KGE.elements
##         r      Beta     Gamma 
## 0.9517110 0.9925498 0.8955749
KGE(obs = obs, sim = svm_prediction_sig, method = "2012", out.type = "full")
## $KGE.value
## [1] -90.43503
## 
## $KGE.elements
##           r        Beta       Gamma 
##  -0.7320031 -90.2282000  -4.8975169